[presto][iceberg] Wire dataSequenceNumber through protocol layer for equality delete conflict resolution#27392
Closed
apurva-meta wants to merge 2 commits intoprestodb:masterfrom
Closed
[presto][iceberg] Wire dataSequenceNumber through protocol layer for equality delete conflict resolution#27392apurva-meta wants to merge 2 commits intoprestodb:masterfrom
apurva-meta wants to merge 2 commits intoprestodb:masterfrom
Conversation
Contributor
Reviewer's GuideWires Iceberg delete file dataSequenceNumber through the Presto Java protocol into the C++ Velox connector, extends the protocol serialization, and propagates it into HiveIcebergSplit so Velox can perform server-side sequence-number-based delete conflict resolution for equality deletes, while also adding the missing equality-delete FileContent mapping and minor const/enum style cleanups. Sequence diagram for wiring dataSequenceNumber through Iceberg split handlingsequenceDiagram
actor Coordinator
participant PrestoJava as PrestoJava_IcebergConnector
participant DeleteFileJava as DeleteFileJava
participant ProtoDeleteFile as DeleteFileProtocol
participant Connector as IcebergPrestoToVeloxConnector
participant HiveSplit as HiveIcebergSplit
participant Reader as IcebergSplitReader
Coordinator->>PrestoJava: Plan query with Iceberg table
PrestoJava->>DeleteFileJava: fromIceberg(org.apache.iceberg.DeleteFile)
DeleteFileJava-->>PrestoJava: DeleteFileJava{dataSequenceNumber}
PrestoJava->>ProtoDeleteFile: Serialize DeleteFileJava to JSON including dataSequenceNumber
ProtoDeleteFile-->>Connector: Deserialize JSON to DeleteFileProtocol{dataSequenceNumber}
Connector->>Connector: toVeloxFileContent(FileContent.EQUALITY_DELETES) -> kEqualityDeletes
Connector->>Connector: Read icebergSplit.dataSequenceNumber into local dataSequenceNumber
Connector->>HiveSplit: Construct HiveIcebergSplit(..., deletes{DeleteFileProtocol.dataSequenceNumber}, infoColumns{"$data_sequence_number"=dataSequenceNumber}, ..., dataSequenceNumber)
HiveSplit-->>Reader: Provide dataSequenceNumber and deleteFiles for conflict resolution
Reader->>Reader: Apply equality delete conflict rules using dataSequenceNumber
Updated class diagram for Iceberg delete file dataSequenceNumber propagationclassDiagram
class DeleteFileJava {
- FileContent content
- String path
- FileFormat format
- int specId
- long recordCount
- long fileSizeInBytes
- List~Integer~ equalityFieldIds
- Map~Integer, byte[]~ lowerBounds
- Map~Integer, byte[]~ upperBounds
- long dataSequenceNumber
+ fromIceberg(deleteFile)
+ getContent() FileContent
+ getPath() String
+ getFormat() FileFormat
+ getSpecId() int
+ getRecordCount() long
+ getFileSizeInBytes() long
+ getEqualityFieldIds() List~Integer~
+ getLowerBounds() Map~Integer, byte[]~
+ getUpperBounds() Map~Integer, byte[]~
+ getDataSequenceNumber() long
+ toString() String
}
class FileContentEnum {
<<enum>>
DATA
POSITION_DELETES
EQUALITY_DELETES
}
class DeleteFileProtocol {
+ List~Integer~ equalityFieldIds
+ Map~Integer, String~ lowerBounds
+ Map~Integer, String~ upperBounds
+ int64_t dataSequenceNumber
+ to_json(j, p)
+ from_json(j, p)
}
class IcebergPrestoToVeloxConnector {
+ toVeloxFileContent(content) FileContent
+ toVeloxSplit(connectorSplit, splitContext) unique_ptr~HiveIcebergSplit~
}
class HiveIcebergSplit {
+ HiveIcebergSplit(
...,
vector~VeloxDeleteFile~ deleteFiles,
unordered_map~string, string~ infoColumns,
optional~string~ deletionVectorPath,
int64_t dataSequenceNumber
)
}
class VeloxDeleteFile {
+ VeloxDeleteFile(
FileContent content,
string path,
FileFormat format,
int32_t specId,
int64_t recordCount,
int64_t fileSizeInBytes,
vector~int32_t~ equalityFieldIds,
map~int32_t, string~ lowerBounds,
map~int32_t, string~ upperBounds,
int64_t dataSequenceNumber
)
}
DeleteFileJava --> FileContentEnum : uses
DeleteFileProtocol --> FileContentEnum : uses
IcebergPrestoToVeloxConnector --> DeleteFileProtocol : reads delete metadata
IcebergPrestoToVeloxConnector --> VeloxDeleteFile : constructs with dataSequenceNumber
IcebergPrestoToVeloxConnector --> HiveIcebergSplit : constructs with dataSequenceNumber
HiveIcebergSplit --> VeloxDeleteFile : aggregates deleteFiles
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
9c1d066 to
84e22f4
Compare
84e22f4 to
dccf95e
Compare
apurva-meta
added a commit
to apurva-meta/presto
that referenced
this pull request
Mar 21, 2026
… for equality delete conflict resolution (prestodb#27392) Summary: Wire the dataSequenceNumber field from the Java Presto protocol to the C++ Velox connector layer, enabling server-side sequence number conflict resolution for equality delete files. Changes: - Add dataSequenceNumber field to IcebergSplit protocol (Java + C++) - Parse dataSequenceNumber in IcebergPrestoToVeloxConnector and pass it through HiveIcebergSplit to IcebergSplitReader - Add const qualifiers to local variables for code clarity == RELEASE NOTES == General Changes * Upgrade Apache Iceberg library from 1.10.0 to 1.10.1. Hive Connector Changes * Add Iceberg V3 deletion vector (DV) support using Puffin-encoded roaring�bitmaps, including a DV reader, writer, page sink, and compaction procedure. * Add Iceberg equality delete file reader with sequence number conflict�resolution per the Iceberg V2+ spec: equality deletes skip when�deleteFileSeqNum <= dataFileSeqNum; positional deletes and DVs skip when�deleteFileSeqNum < dataFileSeqNum; sequence number 0 (V1 legacy) never skips. * Wire dataSequenceNumber through the Presto protocol layer (Java → C++)�to enable server-side sequence number conflict resolution for all delete�file types. * Add PUFFIN file format support for deletion vector discovery, enabling�the coordinator to locate DV files during split creation. * Add Iceberg V3 deletion vector write path with DV page sink and�rewrite_delete_files compaction procedure for DV maintenance. * Add nanosecond timestamp (TIMESTAMP_NANO) type support for Iceberg V3�tables. * Add Variant type support for Iceberg V3, enabling semi-structured data�columns in Iceberg tables. * Eagerly collect delete files during split creation with improved logging�for easier debugging of Iceberg delete file resolution. * Improve IcebergSplitReader error handling and fix test file handle leaks. * Add end-to-end integration tests for Iceberg V3 covering snapshot�lifecycle (INSERT, DELETE with equality/positional/DV deletes, UPDATE,�MERGE, time-travel) and all 99 TPC-DS queries. Differential Revision: D97531547
cdd01c2 to
bc2c773
Compare
apurva-meta
added a commit
to apurva-meta/presto
that referenced
this pull request
Mar 21, 2026
… for equality delete conflict resolution (prestodb#27392) Summary: Wire the dataSequenceNumber field from the Java Presto protocol to the C++ Velox connector layer, enabling server-side sequence number conflict resolution for equality delete files. Changes: - Add dataSequenceNumber field to IcebergSplit protocol (Java + C++) - Parse dataSequenceNumber in IcebergPrestoToVeloxConnector and pass it through HiveIcebergSplit to IcebergSplitReader - Add const qualifiers to local variables for code clarity == RELEASE NOTES == General Changes * Upgrade Apache Iceberg library from 1.10.0 to 1.10.1. Hive Connector Changes * Add Iceberg V3 deletion vector (DV) support using Puffin-encoded roaring�bitmaps, including a DV reader, writer, page sink, and compaction procedure. * Add Iceberg equality delete file reader with sequence number conflict�resolution per the Iceberg V2+ spec: equality deletes skip when�deleteFileSeqNum <= dataFileSeqNum; positional deletes and DVs skip when�deleteFileSeqNum < dataFileSeqNum; sequence number 0 (V1 legacy) never skips. * Wire dataSequenceNumber through the Presto protocol layer (Java → C++)�to enable server-side sequence number conflict resolution for all delete�file types. * Add PUFFIN file format support for deletion vector discovery, enabling�the coordinator to locate DV files during split creation. * Add Iceberg V3 deletion vector write path with DV page sink and�rewrite_delete_files compaction procedure for DV maintenance. * Add nanosecond timestamp (TIMESTAMP_NANO) type support for Iceberg V3�tables. * Add Variant type support for Iceberg V3, enabling semi-structured data�columns in Iceberg tables. * Eagerly collect delete files during split creation with improved logging�for easier debugging of Iceberg delete file resolution. * Improve IcebergSplitReader error handling and fix test file handle leaks. * Add end-to-end integration tests for Iceberg V3 covering snapshot�lifecycle (INSERT, DELETE with equality/positional/DV deletes, UPDATE,�MERGE, time-travel) and all 99 TPC-DS queries. Differential Revision: D97531547
bc2c773 to
d5dc345
Compare
apurva-meta
added a commit
to apurva-meta/presto
that referenced
this pull request
Mar 21, 2026
… for equality delete conflict resolution (prestodb#27392) Summary: Wire the dataSequenceNumber field from the Java Presto protocol to the C++ Velox connector layer, enabling server-side sequence number conflict resolution for equality delete files. Changes: - Add dataSequenceNumber field to IcebergSplit protocol (Java + C++) - Parse dataSequenceNumber in IcebergPrestoToVeloxConnector and pass it through HiveIcebergSplit to IcebergSplitReader - Add const qualifiers to local variables for code clarity == RELEASE NOTES == General Changes * Upgrade Apache Iceberg library from 1.10.0 to 1.10.1. Hive Connector Changes * Add Iceberg V3 deletion vector (DV) support using Puffin-encoded roaring�bitmaps, including a DV reader, writer, page sink, and compaction procedure. * Add Iceberg equality delete file reader with sequence number conflict�resolution per the Iceberg V2+ spec: equality deletes skip when�deleteFileSeqNum <= dataFileSeqNum; positional deletes and DVs skip when�deleteFileSeqNum < dataFileSeqNum; sequence number 0 (V1 legacy) never skips. * Wire dataSequenceNumber through the Presto protocol layer (Java → C++)�to enable server-side sequence number conflict resolution for all delete�file types. * Add PUFFIN file format support for deletion vector discovery, enabling�the coordinator to locate DV files during split creation. * Add Iceberg V3 deletion vector write path with DV page sink and�rewrite_delete_files compaction procedure for DV maintenance. * Add nanosecond timestamp (TIMESTAMP_NANO) type support for Iceberg V3�tables. * Add Variant type support for Iceberg V3, enabling semi-structured data�columns in Iceberg tables. * Eagerly collect delete files during split creation with improved logging�for easier debugging of Iceberg delete file resolution. * Improve IcebergSplitReader error handling and fix test file handle leaks. * Add end-to-end integration tests for Iceberg V3 covering snapshot�lifecycle (INSERT, DELETE with equality/positional/DV deletes, UPDATE,�MERGE, time-travel) and all 99 TPC-DS queries. Differential Revision: D97531547
d5dc345 to
d8cf575
Compare
This was referenced Mar 23, 2026
…tensibility Summary: - Reformat FileContent enum in presto_protocol_iceberg.h from single-line to multi-line for better readability and future extension. - Add blank line for visual separation before infoColumns initialization. Protocol files are auto-generated from Java sources via chevron. The manual edits here mirror what the generator would produce once the Java changes are landed and the protocol is regenerated. Differential Revision: D97531548
apurva-meta
added a commit
to apurva-meta/presto
that referenced
this pull request
Mar 27, 2026
… for equality delete conflict resolution (prestodb#27392) Summary: Wire the dataSequenceNumber field from the Java Presto protocol to the C++ Velox connector layer, enabling server-side sequence number conflict resolution for equality delete files. Changes: - Add dataSequenceNumber field to IcebergSplit protocol (Java + C++) - Parse dataSequenceNumber in IcebergPrestoToVeloxConnector and pass it through HiveIcebergSplit to IcebergSplitReader - Add const qualifiers to local variables for code clarity == RELEASE NOTES == General Changes * Upgrade Apache Iceberg library from 1.10.0 to 1.10.1. Hive Connector Changes * Add Iceberg V3 deletion vector (DV) support using Puffin-encoded roaring�bitmaps, including a DV reader, writer, page sink, and compaction procedure. * Add Iceberg equality delete file reader with sequence number conflict�resolution per the Iceberg V2+ spec: equality deletes skip when�deleteFileSeqNum <= dataFileSeqNum; positional deletes and DVs skip when�deleteFileSeqNum < dataFileSeqNum; sequence number 0 (V1 legacy) never skips. * Wire dataSequenceNumber through the Presto protocol layer (Java → C++)�to enable server-side sequence number conflict resolution for all delete�file types. * Add PUFFIN file format support for deletion vector discovery, enabling�the coordinator to locate DV files during split creation. * Add Iceberg V3 deletion vector write path with DV page sink and�rewrite_delete_files compaction procedure for DV maintenance. * Add nanosecond timestamp (TIMESTAMP_NANO) type support for Iceberg V3�tables. * Add Variant type support for Iceberg V3, enabling semi-structured data�columns in Iceberg tables. * Eagerly collect delete files during split creation with improved logging�for easier debugging of Iceberg delete file resolution. * Improve IcebergSplitReader error handling and fix test file handle leaks. * Add end-to-end integration tests for Iceberg V3 covering snapshot�lifecycle (INSERT, DELETE with equality/positional/DV deletes, UPDATE,�MERGE, time-travel) and all 99 TPC-DS queries. Differential Revision: D97531547
d8cf575 to
69bf631
Compare
apurva-meta
added a commit
to apurva-meta/presto
that referenced
this pull request
Mar 27, 2026
… for equality delete conflict resolution (prestodb#27392) Summary: Wire the dataSequenceNumber field from the Java Presto protocol to the C++ Velox connector layer, enabling server-side sequence number conflict resolution for equality delete files. Changes: - Add dataSequenceNumber field to IcebergSplit protocol (Java + C++) - Parse dataSequenceNumber in IcebergPrestoToVeloxConnector and pass it through HiveIcebergSplit to IcebergSplitReader - Add const qualifiers to local variables for code clarity == RELEASE NOTES == General Changes * Upgrade Apache Iceberg library from 1.10.0 to 1.10.1. Hive Connector Changes * Add Iceberg V3 deletion vector (DV) support using Puffin-encoded roaring�bitmaps, including a DV reader, writer, page sink, and compaction procedure. * Add Iceberg equality delete file reader with sequence number conflict�resolution per the Iceberg V2+ spec: equality deletes skip when�deleteFileSeqNum <= dataFileSeqNum; positional deletes and DVs skip when�deleteFileSeqNum < dataFileSeqNum; sequence number 0 (V1 legacy) never skips. * Wire dataSequenceNumber through the Presto protocol layer (Java → C++)�to enable server-side sequence number conflict resolution for all delete�file types. * Add PUFFIN file format support for deletion vector discovery, enabling�the coordinator to locate DV files during split creation. * Add Iceberg V3 deletion vector write path with DV page sink and�rewrite_delete_files compaction procedure for DV maintenance. * Add nanosecond timestamp (TIMESTAMP_NANO) type support for Iceberg V3�tables. * Add Variant type support for Iceberg V3, enabling semi-structured data�columns in Iceberg tables. * Eagerly collect delete files during split creation with improved logging�for easier debugging of Iceberg delete file resolution. * Improve IcebergSplitReader error handling and fix test file handle leaks. * Add end-to-end integration tests for Iceberg V3 covering snapshot�lifecycle (INSERT, DELETE with equality/positional/DV deletes, UPDATE,�MERGE, time-travel) and all 99 TPC-DS queries. Differential Revision: D97531547
apurva-meta
added a commit
to apurva-meta/presto
that referenced
this pull request
Mar 27, 2026
… for equality delete conflict resolution (prestodb#27392) Summary: Wire the dataSequenceNumber field from the Java Presto protocol to the C++ Velox connector layer, enabling server-side sequence number conflict resolution for equality delete files. Changes: - Add dataSequenceNumber field to IcebergSplit protocol (Java + C++) - Parse dataSequenceNumber in IcebergPrestoToVeloxConnector and pass it through HiveIcebergSplit to IcebergSplitReader - Add const qualifiers to local variables for code clarity == RELEASE NOTES == General Changes * Upgrade Apache Iceberg library from 1.10.0 to 1.10.1. Hive Connector Changes * Add Iceberg V3 deletion vector (DV) support using Puffin-encoded roaring�bitmaps, including a DV reader, writer, page sink, and compaction procedure. * Add Iceberg equality delete file reader with sequence number conflict�resolution per the Iceberg V2+ spec: equality deletes skip when�deleteFileSeqNum <= dataFileSeqNum; positional deletes and DVs skip when�deleteFileSeqNum < dataFileSeqNum; sequence number 0 (V1 legacy) never skips. * Wire dataSequenceNumber through the Presto protocol layer (Java → C++)�to enable server-side sequence number conflict resolution for all delete�file types. * Add PUFFIN file format support for deletion vector discovery, enabling�the coordinator to locate DV files during split creation. * Add Iceberg V3 deletion vector write path with DV page sink and�rewrite_delete_files compaction procedure for DV maintenance. * Add nanosecond timestamp (TIMESTAMP_NANO) type support for Iceberg V3�tables. * Add Variant type support for Iceberg V3, enabling semi-structured data�columns in Iceberg tables. * Eagerly collect delete files during split creation with improved logging�for easier debugging of Iceberg delete file resolution. * Improve IcebergSplitReader error handling and fix test file handle leaks. * Add end-to-end integration tests for Iceberg V3 covering snapshot�lifecycle (INSERT, DELETE with equality/positional/DV deletes, UPDATE,�MERGE, time-travel) and all 99 TPC-DS queries. Differential Revision: D97531547
apurva-meta
added a commit
to apurva-meta/presto
that referenced
this pull request
Mar 27, 2026
… for equality delete conflict resolution (prestodb#27392) Summary: Wire the dataSequenceNumber field from the Java Presto protocol to the C++ Velox connector layer, enabling server-side sequence number conflict resolution for equality delete files. Changes: - Add dataSequenceNumber field to IcebergSplit protocol (Java + C++) - Parse dataSequenceNumber in IcebergPrestoToVeloxConnector and pass it through HiveIcebergSplit to IcebergSplitReader - Add const qualifiers to local variables for code clarity == RELEASE NOTES == General Changes * Upgrade Apache Iceberg library from 1.10.0 to 1.10.1. Hive Connector Changes * Add Iceberg V3 deletion vector (DV) support using Puffin-encoded roaring�bitmaps, including a DV reader, writer, page sink, and compaction procedure. * Add Iceberg equality delete file reader with sequence number conflict�resolution per the Iceberg V2+ spec: equality deletes skip when�deleteFileSeqNum <= dataFileSeqNum; positional deletes and DVs skip when�deleteFileSeqNum < dataFileSeqNum; sequence number 0 (V1 legacy) never skips. * Wire dataSequenceNumber through the Presto protocol layer (Java → C++)�to enable server-side sequence number conflict resolution for all delete�file types. * Add PUFFIN file format support for deletion vector discovery, enabling�the coordinator to locate DV files during split creation. * Add Iceberg V3 deletion vector write path with DV page sink and�rewrite_delete_files compaction procedure for DV maintenance. * Add nanosecond timestamp (TIMESTAMP_NANO) type support for Iceberg V3�tables. * Add Variant type support for Iceberg V3, enabling semi-structured data�columns in Iceberg tables. * Eagerly collect delete files during split creation with improved logging�for easier debugging of Iceberg delete file resolution. * Improve IcebergSplitReader error handling and fix test file handle leaks. * Add end-to-end integration tests for Iceberg V3 covering snapshot�lifecycle (INSERT, DELETE with equality/positional/DV deletes, UPDATE,�MERGE, time-travel) and all 99 TPC-DS queries. Differential Revision: D97531547
…equality delete conflict resolution Summary: Wire the dataSequenceNumber field from the Java Presto protocol to the C++ Velox connector layer, enabling server-side sequence number conflict resolution for equality delete files. Changes: - Add dataSequenceNumber field to IcebergSplit protocol (Java + C++) - Parse dataSequenceNumber in IcebergPrestoToVeloxConnector and pass it through HiveIcebergSplit to IcebergSplitReader - Add const qualifiers to local variables for code clarity Differential Revision: D97531547
69bf631 to
987f9ba
Compare
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Wire the dataSequenceNumber field from the Java Presto protocol to the
C++ Velox connector layer, enabling server-side sequence number conflict
resolution for equality delete files.
Changes:
through HiveIcebergSplit to IcebergSplitReader
Differential Revision: D97531547